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We consider the problem of learning the preferences of a heterogeneous population hy observing choices 
from an assortment of products, ads, or other offerings. Our observation model takes a form common in 
assortment planning applications: each arriving customer is offered an assortment consisting of a subset of 
all possible offerings; we observe only the assortment and the customer’s single choice. 

In this paper we propose a mixture choice model with a natural underlying low-dimensional structure, 
and show how to estimate its parameters. In our model, the preferences of each customer or segment follow 
a separate parametric choice model, but the underlying structure of these parameters over all the models 
has low dimension. We show that a nuclear-norm regularized maximum likelihood estimator can learn the 
preferences of all customers using a number of observations much smaller than the number of item-customer 
combinations. This result shows the potential for structural assumptions to speed up learning and improve 
revenues in assortment planning and customization. We provide a specialized factored gradient descent 
algorithm and study the success of the approach empirically. 

CCS Concepts: 'Information systems—Learning to rank; Recommender systems; 'Computing 
methodologies—> Factorization methods; Learning from implicit /eedhac^;' Mathematics of comput¬ 
ing—> Convex optimization; Maximum likelihood estimation; 

Additional Key Words and Phrases: Personalization, Assortment planning. Discrete choice. High-dimensio¬ 
nal learning. Large-scale learning, First-order optimization, Recommender Systems, Matrix completion 


1. INTRODUCTION 

In many commerce and e-commerce settings, a firm chooses a set of items (products, 
ads, or other offerings) to present to a customer, who then chooses from among these 
items. Each choice results in some revenue for the firm that depends on the item se¬ 
lected. The problem of choosing the revenue-maximizing assortments of items to offer 
to the customer is referred to as assortment planning or assortment optimization. To 
solve this problem, a firm must estimate customer preferences, and then choose an 
optimal assortment of goods to present based on those estimates. 

Usually the number of interactions between the firm and customer is limited so 
efficient estimation of customer preferences is critical. But estimating customer pref¬ 
erences is no easy task: there are combinatorially many assortments of goods, and so 
potentially combinatorially many quantities to estimate. To enable tractable estima¬ 
tion, customer preferences are often modeled parametrically, using the multinomial 
logit (MNL) model or its variants. 
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However, one parameter vector may not fit everyone in a heterogeneous population. 
For a better fit, it can be important to segment the population into groups (geographic, 
demographic, or temporal) and to fit parameters for each segment. In the limit, each 
customer may represent a separate segment. In the e-commerce and online advertis¬ 
ing settings, populations can be segmented into individual customers, since firms have 
data on the indidual customers’ choices from individually customized assortments. In 
the offline brick-and-mortar retail setting, populations may be segmented by store 
branch, since firms have data on aggregate choices in each store branch and tastes 
often vary geographically and by venue (mall, street, etc.). Customers may also be 
segmented into smaller groups using loyalty program data. 

The number of observations needed to estimate the parameters of the model for each 
segment is at least linear in the number of items. But in modern advertisement mar¬ 
kets or online retail settings, tens of millions of items may be on offer: far more than 
the number of ads any one customer can be expected to view or click, or products any 
one customer can be expected to consider or buy. Moreover, the sort of data available 
in practice is limited to observation of a single choice of a single item out of a whole 
assortment, rather than a full ranking or a pairwise comparison. 

In this paper we propose a new model that enables tractable estimation as the num¬ 
ber of segments and items grows. We suppose the preferences of each type (individual 
customer or customer segment) follow a parametric (MNL) choice model, while the 
underlying structure of these parameters over types has a low dimension. We show 
that a nuclear-norm regularized maximum likelihood estimator based on such data 
can learn the preferences of ah customers using a number of observations that grows 
sublinearly in the number of type-item combinations. This result shows the potential 
for structural assumptions to speed up learning and improve revenue in assortment 
planning and customization. 

1.1. Related Work 

The model presented in this paper builds on two distinct lines of research: on choice 
modelling and assortment optimization, and on low-rank matrix completion. 

Choice Modelling and Assortment Optimization. Assortment optimization is a cen¬ 
tral problem in revenue management: which items should be presented to a given 
customer in order to maximize the expected revenue from the customer’s choice? 

The first step in answering this question is to understand how customers choose 
from among a set of goods. Discrete choice models posit answers to this question in the 
form of a probability distribution over choices. Luce [1959] proposed an early discrete 
choice model based on an axiomatic theory, resulting in the basic attraction model. 
Later, the work of McFadden [1973] on random utility theory led to the MNL model, 
which posits that customer choices follow a log-linear model in a vector of customer 
preference parameters. Fitting a single MNL model is as simple as counting the num¬ 
ber of times an item is chosen relative to the other offerings. (These counts give the 
maximum likelihood estimate for the model.) Under the MNL model, it is easy to op¬ 
timize assortments: presenting items in revenue sorted order is optimal [Tahuri and 
Van Ryzin 2006]. 

Conceiving of assortments as arms and consumer choice as bandit feedback, Rus- 
mevichientong et al. [2010] and Saure and Zeevi [2013] consider d3mamic assortment 
optimization under a single MNL model. The former also present a polynomial-time 
algorithm for optimizing assortment under an MNL model and with cardinality con¬ 
straints. 

The mixture of MNLs (MMNL) model models consumer choice as a mixture of MNL 
models with different parameters. Unfortunately, it is NP-hard to optimize a single 


assortment to be offered to an MMNL population, even with only two mixture compo¬ 
nents [Rusmevichientong et al. 2014]. Other derivatives of the MNL model include the 
nested logit model [Williams 1977] and its extensions [McFadden 1980]. Assortment 
optimization over these is also computationally hard [Davis et al. 2014]. 

Matrix Completion. Recent years have seen a surge of interest in matrix completion: 
the problem of (approximately) recovering an (approximately) low rank matrix from a 
few (noisy) samples from its values. The surprising result is that simple algorithms, 
such as nuclear norm regularized maximum likelihood estimation, can often recover a 
low rank matrix given only a small number of observations in each row or column. 

Following groundbreaking work on exact completion of exactly low rank matrices 
whose entries are observed without noise [Candes and Recht 2009; Candes and Tao 
2010; Keshavan et al. 2010; Recht et al. 2010], approximate recovery results have 
been obtained for a variety of different noisy observation models. These include obser¬ 
vations with additive gaussian [Candes and Plan 2009] and subgaussian [Keshavan 
et al. 2009a] noise, 0-1 (Bernoulli) observations [Davenport et al. 2012], observations 
from any exponential family distribution [Gunasekar et al. 2014], and observations 
generated according to the Bradley-Terry-Luce model for pairwise comparisons [Lu 
and Negahban 2014]. 

Our results follow in the vein of the statistical matrix completion bounds. The proof 
of our main result uses the machinery of restricted strong convexity developed by Ne¬ 
gahban and Wainwright [2012], and echoes many of the ideas in the technical report by 
Lu and Negahban [2014], which proved a similar sample complexity result for matrix 
completion from observations of pairwise ranks. Our method extends these ideas to 
address observations consisting only of a single choice from an arbitrarily-sized subset 
of all items. Most recently (and independently of our work). Oh et al. [2015] extended 
the results of Lu and Negahban [2014] to observations of the ranking of all items in a 
subset, rather than the (pairwise) ranking of items in a subset of size two. A primary 
distinguishing feature of our work is the observation model we consider — observing 
choices, rather than rankings — which applies naturally to the type of data available in 
realistic applications of assortment planning. This type of data is much more common 
in practice because it corresponds to passive observations of consumer behavior that 
is truthful (assuming choice is utility maximizing) and it is available in large amounts 
because it can be derived from transactional data. We also use weaker assumptions on 
the distribution of assortment sets offered: in particular. Oh et al. [2015] require that 
the sets contain duplicate items (with nonzero probability), where duplicated items are 
chosen with higher frequency, violating independence of irrelevant alternatives. 

1.2. Contributions 

This paper makes four main contributions. 

— We propose the low rank MMNL model for customer preferences: the preferences 
of each type follow a parametric (MNL) choice model, but the underlying latent 
structure of parameters over types has low dimension. 

— We consider the problem of learning such a choice model from observations of 
choices from assortments and propose a nuclear-norm regularized maximum likeli¬ 
hood estimator. 

— Theorem 3.1 provides the first bound on sample complexity of learning this model 
from choice data. 

— Algorithm 1 provides a fast method for computing our estimator with a small mem¬ 
ory footprint. 


2. PROBLEM STATEMENT AND ALGORITHM 

We suppose that at each time t = a customer arrives with type it chosen 

uniformly at random from the set m}. The customer is presented with a choice 

of items St C {1,... ,j} of size \St\ = Kt, where St\Kt is sampled uniformly from the 
set of subsets of {1,..., n} of size Kt. We make no assumption on the distribution of Kt 
other than that Kt < K is bounded almost surely. 

The customer then chooses an item according to a multinomial logit model: item 
jt e St is chosen with probability 



^{jt=i\it=hSt = S) 


( 1 ) 


The standing assumption is that 6* is of low underlying dimension, either having low 
rank r ^ m,n or approximately low rank (see below for details). 

After T observations {it,jt, St) from this model, we wish to estimate the parameter 
matrix 0*. To eliminate redundant degrees of freedom, we assume without loss of 
generality that J2j=i ^ij = 0 for every i = 1,..., m, i.e., 0*e = 0. We also assume that 
||0*|loo ^ aj\Jmn for purely technical reasons; see below. This assumption is standard 
in matrix completion recovery results. 

Discussion. Model (1) describes an MMNL over mixture components indexed by i. 
Using the approximation results of McFadden and Train [2000], for any choice dis¬ 
tribution over a population, there is a variable I such that, the choice distribution is 
approximately MNL conditioned on I. Hence the model above can represent any choice 
distribution so long as the population is segmented finely enough. 

The MMNL model is equivalent to a utility maximizing model under a particular 
distribution of customer utilities. Define utj = — 0*j to be the nominal utility of type i 
for item j. Suppose that a random customer of type i has utility utj + Qj, where Qj ~ 
Gumbell(0,1) is a random idiosyncratic deviation from the nominal utility distributed 
according to the extreme value distribution. When presented with an assortment, each 
customer simply chooses the product jt maximizing her own utility function: 


jt = max {uij + Cij) ■ 


( 2 ) 


Our mathematical model (1) can be applied in two conceptually distinct problem 
settings. In the first setting, each type i represents a group of customers, each one of 
which has an idiosyncratic utility distributed as in (2). We say that the population is 
heterogeneous because each type i is associated with a different nominal utility utj for 
each product j. In the second setting, each type i represents a single customer. The 
random idiosyncracies associated with each choice made by the same customer reflect 
human inconsistencies in decision making, or slight variations in preferences over time 
[DeShazo and Fermo 2002; Kahneman and Tversky 1979]. 

The observation model we propose has several advantages. Our observations consist 
of customers’ choices from assortments. These choices are typically the only observa¬ 
tion possible in applications of assortment planning. Moreover, these observations tend 
to be truthful: customers generally make choices to maximize their utility. Observed 
choices should be contrasted with data obtained by interrogating customers about their 
ranking of products (for example, in surveys or focus groups); these stated preferences 
are known to be unreliable as indicators of true, revealed preferences [Samuelson 
1948]. 

One limitation of our results is that we require a somewhat random design (ran¬ 
dom St) to guarantee that preferences are learned accurately. Hence our results can 



either be interpreted as a prescription on how to design practical, consistent experi¬ 
ments in consumer choice where only choices are observed, or they can be interpreted 
as theoretical justification for our algorithm, even if applied to data of general design. 
Indeed, the empirical success of matrix completion approaches on real world data sets 
suggests that the algorithm may work well even when applied to non-random obser¬ 
vations [Funk 2006]. Other authors have suggested variations on the nuclear norm 
regularizer we propose to compensate for non-uniform sampling distributions [Foygel 
et al. 2011]; generalizing these variations to observations of assortments is an inter¬ 
esting problem beyond the scope of this paper. 

We are not the first to model customer preferences as a mixture of MNLs. For exam¬ 
ple, Bernstein et al. [2011] consider the problem of multi-stage assortment optimiza¬ 
tion over time with limited inventories and for consumers from two or more segments, 
each distributed as MNL. They do not consider a learning problem but consider a stim¬ 
ulation study of their optimization techniques based on MNL distributions estimated 
from real choice data. This estimation method scales poorly with the dimension of the 
problem and so is limited to very few segments (three in their case study). Our model, 
on the other hand, scales easily to very large problems: fixing the rank of the parameter 
matrix, the complexity of the estimation problem increases linearly with the number 
of types and products, rather than the number of their combinations. 

Algorithm. Define the negative log likelihood of the observations given parameter 0 
as 

t=i VieSt 

Define the estimator 0 to be any solution of the nuclear norm regularized maximum 
likelihood problem 

minimize L'r(0)-t A||0||*, 

subject to 0e = 0, (3) 

||0||oo < ajs/rrm, 

where A > 0 is a parameter, e is the vector of all I’s, and we define the nuclear norm 
II 0 II* to be the sum of the singular values of 0. The constraint ||0||oo < aj^/mn appears 
as an artifact of the proof; this last constraint can be omitted without sacrificing good 
practical performance. See Sec. 6. 

Problem (3) is convex and hence can be solved by a variety of standard methods 
[Boyd and Vandenberghe 2004]. In Sec. 5 we provide a specialized first-order algorithm 
that works on the non-convex, factored form of the problem. 

3. MAIN RESULT 

We bound the error of our estimator (3) in terms of the following quantities, which 
capture the complexity of learning the preferences of all customer types over all items. 

— Number of observations. The bound decreases as the number of observations in¬ 
creases. 

— Number of parameters. The bound grows as d = grows, where 0* consists of 
m X n parameters. 

— Underlying rank dimension. For any r < min (m,n), our bound decomposes into 
two error terms. The first error term is the error in estimating the top r “princi¬ 
pal components” of the parameter matrix. This error grows with ^/r and captures 
the benefit of learning only the most salient features instead of all parameters at 
once. The second error is the error in approximating the parameter matrix by only 



its top r “principal components.” In particular, if it is assumed that 0* is exactly 
rank r then this latter error is always zero. More generally, however, we may con¬ 
ceive of parameter matrices that are approximately low rank, i.e., that have quickly 
decaying singular values past the top r, which would lead to a nonzero hut small 
approximation error. 

— Size of parameters. Our hound grows with the (scaled) maximum magnitude of any 
entry a. 

— Size of choice sets. Our hound grows with the maximum size of the choice sets K. 


Theorem 3.1. Suppose T < dPlogd and A = Then under the obser¬ 

vation model described and for any integer r < min (m, n), with probability at least 
1 — Ajdf, any solution 0 to Problem (3) satisfies 


II0 — 0*||f < 2048016^^ max< 


lK^d\og{d) 

T 


fK^d\og{d) 


1/4 


' min{m,n} 

E 

. i=r+l 



4. PROOF OF MAIN RESULT 

Define A = 0 — 0* as the (matrix) error to hound, Kt = |S't|, = 

Sf\{jt}, 7 = Xtj = Ci^eJ^ - Ci^eJ, and yt(0) = Var ({0^^^ : j G S'*}) = 

“ ~kiJ2j'eSt following three lemmas, whose 

proofs we defer to Sec. 8. 

Lemma 4.1. Let 


-4r,.= 0:||0||oo<7, l|e||ir<r, ||0||, < 


2A0VKmnyV d\ogd 


T 


-Tf 0e = 0 


and 


Mr..= sup (ll||0||2-l^y,(0)| . 


Then 


eeAr.^ \ mn 


t=i 


Lemma 4.2. Let 

7l’^=|0:||0|U<7, l|0||*< 


, 8 1^2 p4 

I 


\2d>\/Kmny\l dlogd 


||0|||, 0e = O 


Then 


1 


V0e.4M >l-2d 


- 2 -‘ 


T ■ 


t=i 


2mn 


Lemma 4.3. With probability at least 1 — 2d 

||VLt(0*)||2<16 


K dlogd 


mnT 



















Proof (Theorem 3.1). Let us first assume A g A*, and restrict to the high proh- 
ahility event that the events in hoth Lemma 4.2 and Lemma 4.3 occur. 

Define D = Lt{Q) - Lt(6*) - VLt{0*) ■ (0 - 6 *). By Taylor’s theorem, 3s G [0,1] 
such that 


D = v2Lt(0* + sA)[A,A] 


1 

T 


E 


|^ (l + Ejg5 


;e''*0(Ejgs; ■ Af) 

(1 + Ejgs; 



1 

(1 + Ejgsi 


i6s; 


(l + Eigs;e"«)^ ) 


where vtj = Xtj ■ (0* + sA) and the last inequality is Jensen’s. Since ||0*||oo, ||0||oo < 7 
we have \vtj \ < 27 and since the mean minimizes sum of squared distances, 


D > 





^^(A*,.A)2> 
* i6s; 


111 




t=l 


(4) 


Let Q* = U diag(fTi, ct 2 , ■ ■ ■ )V'^ he the singular-value decomposition (SVD) of 0* with 
singular values sorted largest to smallest. Using block notation and following Recht 
et al. [2010], let us rewrite A and define A', A" 

U'^AV = r = f ^ with Lii G 

\ i 21 -1-22 J 

Then A = UTV'^ = A' + A". Note that, 

rank(A') = rank((7^A'U) = rank ^ ^ 2r-. 


Letting Q* = U diag(a'i,..., 0 -^, 0,0,... and its complement 0^ = 0* — 0*, we see 

\\e\U = I| 0 * + All, = 110; + 0; + A' + A"||, 

> ||0* + A"|U- 110 : 11 .-IIA'II, 

= ||0;||. + ||A"||.-||0;||.-||A'||, 

= ||0*||. + ||A"||.- 2110 : 11 ,-||A'||„ 

and so ||0||, - ||0*||, < 2||0:||, + ||A'||, - ||A"||.. 

By the optimality of 0, we have 

Lr(0)+A||0||, <L(0’^) + A||0*||,. 

Hence, hy Holder’s inequality, 

0 < D = Lt(0) - Lt(0'*) - VLt(0*) • A 
<||VLt(0*)||2I|A|L + a(||0*||,-||0||,). 

Since |tVL7-(0*)||2 < A, triangle inequality in (5) yields 

D<||VLt(0*)||2I|A||, + A||A|L <2A||A||,. 


( 5 ) 





Together with Lemma 4.2 and eq. (4) (for the lower hound) and our choice of A (for the 
upper hound), this yields 


—5 - 77 ||A||^ < D < 64 

2e®'''mnA ^ 


Kdlogd 

mnT 


Hence, recalling 7 = 


||A||^<128ae^i^3/2yrf|p||^||^^ 
Returning to (5), since t|VAT( 0*)||2 < -^72, we have 

O<||VLT(0*)||2||A||.+A(||e*||.-||0|u) 


( 6 ) 


(7) 


<A^-||A|U+2||0,|U + ||A'|U-||A" 
< A (^ 2110 : 11 ,+ ^||A'|U-^||A"|U), 


and so ||A"||^ < 3 ||A'||^+4||0;||*. Let p = ||0;||* = E7=r^L’”^ Since I|A||p-||A '||f = 


|r22|lF > 0, 

||A|L<||A'|L + ||A"|L=4||A'|L+4p 

< 8 max{| jA'II^ , p} < 8 max |\/^ |! A'||p , p| 

< 16 max{Vr||A|!p ,p} . 

If v^ll A||f > p, substitute (8) in (7) to see 


( 8 ) 


|Al|p < 2048ae^A:^/2 


rd log d 


T 


Otherwise (if ^r|| A||f < p), substitute (8) in (7) and take the square root to see 


|Al|p < \/2048ae^A:3/2 


pd log d 


T 


< 2048ae 


7 pdlogd \ 


1/4 


Combining yields the statement. 

Our last step is to investigate the case where A 0 A*, i.e., 


I|A|U> 


1 


128 v^ 


\mn'y 


T 


dlogd 


l|A||^. 


In this case we can’t use (7), which relies on Lemma 4.2. But rewriting and introducing 
redundant terms greater than 1, we see 

||A||2 < 128ae^A:3/2y^^^||A|U, 

Hence we recover (7) whether or not A ^ □ 




















ALGORITHM 1: Factored Gradient Descent for (9) 

input: dimensions m, n, f, data St)}J=i, regularizing coefficient A, and tolerance r 

U ^U°,V ^V°, f' ^ oo. 

repeat 

-XU, AV ^ -XV 

for t = 1,..., r do 

g = 0 

for j G St do 

-hTv. 


W ^W + Wj 

end for 

AU AU — 7p ^ ) 

AF ^ Ay - i {e,Mt - ^ E,est 

end for 
repeat 

U' ^U + riAU, V + vAV 
f ^ Lt{U'V'^) + ^\\U'\\l + ^WV'Wl 

V ^ Pdecr^ 

until /' < / 

U^U’,V^V' 
until < T 
output: UV'^ - UV'^ee'^ln 


5. A FACTORED GRADIENT DESCENT ALGORITHM 

In this section we provide a factored gradient descent (FGD) algorithm for the prob¬ 
lem 

minimize Lt( 6)-f A||0||*, 
subject to 0e = 0. 

As discussed in Sec. 2, the algorithm we employ does not enforce the constraint !|0||oo, 
as is common for matrix completion. This constraint is necessary for the technical 
result in our main theorem, but is unnecessary in practice, as can be seen in our nu¬ 
merical results in Sec. 6. 

In applications, one is interested in solving the problem (9) for very large m, n, T. 
Due to the complexity of Cholesky factorization, this rules out theoretically-tractable 
second-order interior point methods. One standard approach is to use a first-order 
method, such as Cai et al. [2010]; Kazan [2008]; Orabona et al. [2012]; Parikh and 
Boyd [2014]; however, this approach requires (at least a partial) SVD at each step. 
An alternative approach, which we take here, is to optimize as variables the factors 
U G and V G R"^’’ of the optimization variable 0 = UV^ rather than producing 

these via SVD at each step; see, e.g., Jain et al. [2013]; Keshavan et al. [2009b]. To 
guarantee equivalence of the problems, we must take f = mm{m,n). However, if we 
believe the solution is low rank, we may use a smaller f, reducing computational work 
and storage. 

Our FGD algorithm proceeds by applying gradient descent steps to the uncon¬ 
strained problem 

minimize Lt{UV'^) + ^\\U\\1 + ^\\V\\1, 
subject to (7 G V G R”^’". 

Lemma 5.1. Problem (10) is equivalent to Problem (9) subject to the additional 
constraint rank(0) < f. 







— n=m=500 


— n=m=1000 

— n=m=4000 


Fig. 1: RMSE of estimators for 0* with m = n = 100,150,200,..., 500, 750,..., 4000 and 
r = 2, 25. 


Proof. Given 0 feasible in (9) with rank(0) < f, write its SVD 0 = , where 

E £ is diagonal and U, V unitary. Letting U = and V = FE^/^, we obtain 

a feasible solution to (10) with the same objective value 0 has in (9). Conversely, given 
U, V feasible in (10), let 0 = UV'^ — UV'^ee^/n. Note rank(0) < f, 0e = 0, Lt{UV'^) = 
L'r( 0 ) as I/r( ) is invariant to constant shifts to rows, and 

lieiu < \\UV^\U = tr(E) = tiiU'^UV'^V) 

< < iigiif||fiif < l \\ u\\i + 

Hence, 0 has objective value no worse than {U, V). □ 

It is easy to compute the the gradients of the objective of (10). Since Lt{Q) is differ¬ 
entiable, 

VuLt{UV'^) = VLt{UV'^)V, 

VvLTiUV'^) = VLTiUV'^fU. 

We do not need to explicitly form VLt{UV'^) in order to compute these; this obser¬ 
vation reduces the memory required to implement the algorithm (see Algorithm 1). 
Similarly, we need not form UV^ to compute Lx{UV^). Recent work has shown that 
gradient descent on the factors converges linearly to the global optimum for problems 
that enjoy restricted strong convexity [Bhojanapalli et al. 2015]. In eq. ( 6 ) we establish 
restricted strong convexity for our problem with high probability. 

We initialize FGD using a technique recommended by Bhojanapalli et al. [2015] 
which only requires access to gradients of the objective of (9). Using the SVD, write 
-VLt(O) = U diag(di,..., 5-min(m.n))U^ and initialize 

JJO = ^-1/2 diag(A/^, . . . , 
yo =-),-i/2diag(A/5'i,---,\/^)V,f=, 

where 7 = ||VLt( 0) — (VL^jeief) -I- Aeief)||F and U-,^, V.f denote the first r columns 
of U, V. We use an adaptive step size with a line search that guarantees descent. 
Starting with a stepsize of ry = 1, the stepsize is repeatedly decreased by a factor /3dec 
until the step produces a decrease in the objective. We terminate the algorithm when 
the decrease in the relative objective value is smaller than the convergence tolerance 











n=m=100 

n=m=150 

n=m=500 

n=m=750 

n=m=1000 

n=m=4000 


Fig. 2: RMSE of our estimators for 0* by observation per row with r = 2. 


6. EXPERIMENTAL RESULTS 

In this section we study the problem experimentally to investigate the success of our 
algorithm. We compare our estimate 0 with the standard maximum likelihood esti¬ 
mate 0''*'^® that solves 

minimize Lt{Q), 
subject to 0e = 0. 

Note that since it imposes no structure on the whole matrix 0, problem (11) decom¬ 
poses into m subproblems for each type (row of 0''^’^^), each solving a separate MNL 
MLE in n variables. In our experiments, we use Newton’s method as implemented by 
Optim.jl to solve each of the subproblems, omitting the constraint 0e = 0 and project¬ 
ing onto it at termination since Lt{-) is invariant to shifts in this subspace. 

To generate 0*, we fix m, n, r, let 0o be an m x n matrix composed of indepen¬ 
dent draws from a standard normal, take its SVD Qq = U diag (cti, 0 - 2 , • • •) , trun¬ 

cate it past the top r components 0i = (7 diag (ui,..., Ur, 0,...) V'^, and renormalize to 
achieve unit sample standard deviation to get 0*, i.e., 0* = 0i/std(vec(0i)). To gen¬ 
erate the choice data, we let it be drawn uniformly at random from {1,... ,m}, St be 
drawn uniformly at random from ah subsets of size 10, and jt be chosen according to 
(1) with parameter 0*. 

For our estimator we use Algorithm 1 with f = 2r, \ = \\J = 0.8, and 

T = 10“^°. This regularizing coefficient scales with m, n, d, T, and K as suggested 
by Theorem 3.1, but we find the algorithm performs better in practice when we use a 
smaller constant than that suggested by the theorem. 

We plot the results in Figure 1, where error is measured in root mean squared error 
(RMSE) 

RMSE(0) = y/Avg({(0,,-0*.)2K,) = ^ ||0 - 0*||p. 

The results show the advantage in efficient use of the data offered by our approach. The 
results also show that, relative to MLE, the advantage is greatest when the underlying 
rank r is small and the number of parameters m x n is large, but that we maintain a 








significant advantage even for moderate r and mxn. For large numbers of parameters 
(m = n > 750), the RMSE of MLE is very large and does not appear in the plots. Only 
in the case of greatest rank (?’ = 25), smallest number of parameters (m = n = 100), 
and greatest number of observations (T = 10®) does MLE appear to somewhat catch 
up with our estimator. 

In Figure 2, we plot the RMSE of our estimator against the number of observations 
per type (or item) T/d for a square problem with d = m = n. We see nearly the same 
error curve traced out as we vary the problem size d. This scaling shows that our 
estimator is able to leverage the low-rank assumption and require the same number 
of choice observations per type to achieve the same RMSE regardless of problem size. 


7. CONCLUSION 

This paper proposes a new model for assortment choices — the low rank MMNL model 
— in which the preferences of each type follow a parametric multinomial logit distribu¬ 
tion, but share a latent linear structure. We show that this preference structure can be 
efficiently learned both in theory and in practice through a bound on the sample com¬ 
plexity and through numerical simulations. The optimization problem we propose ad¬ 
mits a fast algorithm that scales linearly in the number of t 3 q)es and products. The low 
rank MMNL approach can make learning choice models from choice data significantly 
more efficient compared with standard methods, and thereby enables a fine-grained 
understanding of preferences in a diverse population. 


8. PROOFS OF PRELIMINARY LEMMAS 

Proof (Lemma 4.1). Let 0e = 0. Then because St is permutation symmetric we 
get that 
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Letting es € be the indicator vector of the set S, 
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Note 11 112 is 27 fc-Lipschitz with respect to oo-norm on a domain in [— 7 , 7 ]” where only 
k entries are nonzero. Therefore, by Lemma 7 of Bertsimas and Kallus [2014] and by 

Holder’s inequality, letting Wt = J2jeSt “ ih ^j'eSt where etj are iid 
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Note limtilj < \/T^t < '/K. Moreover, 


Since S't|R't is uniform. 
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by iterated expectation and Jensen’s inequality. The matrix Bernstein inequality 


[Tropp 2012, Thm. 1.6] gives that I ^ J2^=i m > <5 with probability at most 
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Setting the probability to 1 and using T < d? log d, 
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Putting it all together, we get. 
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Next we use this to prove the concentration of Mt,v Let Alp be a replicate of Mt,v 
with i'^ = if, S'f. = St for all t except t'. The difference Air,!/ — Alp^is bounded by 
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Hence, by McDiarmid’s inequality, we have 

IP — EA^r,i/ > S) < e . 

Using <5 = ^ ^ and 'EMt,v < (i/2 we get the result. □ 

Proof (Lemma 4.2). Since || • ||* > || • ||f, we have inf©g^. ||6 ||f > t := 
\2E\/Kmn-y^Jd\ogd/T. Let Ai = yl* n {\/2^ ^ < ||6 ||f < \/2*} and note that A* = 
U^i Ai and Ai C -dy 2 L.i/ 4 - Moreover, if 0 e yl; has ^ J2f=i ^t(O) < ^ ||0|If then 
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Therefore, with p denoting the probability in the statement of the theorem to be 
bounded. 
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using Lemma 4.1, T < d? log d, and K >2. □ 

Proof (Lemma 4.3). Let i?t(0) = where ptj(0) = 

Then VLt(0) = ^ Rt{Q)- Note that because jt is drawn according to 0*, we have 
that E [i7t(0*)|it, S'*] = 0 and hence Ei7t(0*) = 0. Let Rt = i?t(0*), ptj = Vtj{Q*)- Note 
that ||i7t||2 < V2- Moreover, letting k* = ??t/, 
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Since by Jensen’s inequality the multiplier in the parentheses is no greater than 2, we 

get I |E [RtRf] 1 12 < ^1 < ^- Letting ytj =l[j = jt], we have 
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By matrix Bernstein inequality [Tropp 2012, Thm. 1.6], 1 J2't=i ^t| | ^ ^ with prob¬ 
ability at most 
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